Prosper is a peer to peer lending company that offers personal loans at low rates. These loans are unsecured, which means you do not have to put up any collateral (like a house or car) that could get taken away if you can’t make payments. Each loan is typically funded by multiple people all over the United States.[4]
During this exploratory data analyze, I will investigate about investment return, investment loss and the risk for investors.
The data set contains 113,937 observations with 81 variables on each loan, including loan amount, borrower rate, current loan status, borrower income, borrower employment status, borrower credit history, number of investors, prosper rating and the latest payment information. More information about the prosper loan dictionary is available at the web address: https://docs.google.com/spreadsheets/d/1gDyi_L4UvIrLTEC6Wri5nbaMmkGmLQBk-Yx3z0XDEtI/edit?usp=sharing.
The first part of the analysis is demonstrate the dataset structure.
## 'data.frame': 113937 obs. of 83 variables:
## $ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ ClosedDate : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ CurrentlyInGroup : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
## $ GroupKey : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
## $ DateCreditPulled : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent..percentage. : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
## $ MemberKey : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
## $ year : chr "2007" "2014" "2007" "2012" ...
## $ CreditScore : num 650 690 490 810 690 ...
This part of the investigation will analyze the data distribution.
The investor is any person or organization who commits capital with the expectation of financial returns[1]. At Prosper Company, the loan can be fund by one or more investor.
I would like to know how much investor funds a loan?
The number of investor in a loan vary from 1 to 1189. On average, the loan has around 80 investors, but frequently just one investor funds a loan. The last plot with a positive skew distribution and a long tail showed the distribution from the number of investor by loan.
After July 2009, the variable estimated effective yield, estimated loss and estimated return was added to the dataset. Effective yield is equal to the borrower interest rate minus the servicing fee rate, minus estimated uncollected interest on charge-offs, plus estimated collected late fees. The estimated loss is the estimated principal loss on charge-offs.[8]
I would like to use these variables to find out how much is the estimated loss and estimated return for an investor?
A density plot with the estimated loss and effective yield in percentage will help us to answer this question.
I figured out both estimated loss and effective yield in percentage in the same plot will clarify that both variables show multimodal distribution. The effective yield covers a larger positive area from the plot than the estimate loss, what means it is more likely it get an effective yield greater than estimated loss. On average, the estimated loss is 8.03% and the effective yield is 16.87.
Estimated return variable is the difference between the Estimated Effective Yield and the Estimated Loss Rate.[8] A plot from this variable will go to the heart of the last question.
The histogram from the estimated return shows good news. Loan founders can have great returns. Analyzing the estimated return distribution, it is a normal distribution with frequently returns in the positive area from the plot what means it is likely the investor will have a positive return from their investment. The average return is around 9.61%.
The minimum is -18.27 and it is not superior to 28.37 percent by year and 50% of the loan has estimated return varying from 7.408% to 11.66%.
Credit management is the process of granting credit, the terms it’s granted on and recovering this credit when it’s due. This is the function within a bank or company to control credit policies that will improve revenues and reduce financial risks.[10]
One of the key question for an investor is to figure out the investment risk level and the dataset has rating variable that should answer the question: How much is the investiment risk when the investor fund a loan?.
The prosper score and the prosper rating were built after 2009 to custom risk using historical proper data. For prosper score, the score ranges from 1-10, with 10 being the best, or lowest risk score. Prosper rating is categorized in this way: 1 - HR, 2 - E, 3 - D, 4 - C, 5 - B, 6 - A, 7 - AA, where HR is the riskier and AA is the safer category.[8]
An interesting point about these plots, prosper score plot has lower risk greater than higher risk and prosper rating is inverse. What is the most trustworthy variable?.
The dataset also store the lower and upper credit score provided by a consumer credit rating agency. The next plot will show the distribution from the lower divided by the upper range. Closer the one, closer is the difference between them.
The result is a skinny distribution close to 1, what means the upper and the lower value is very close, what also mean the average won’t have significant bias.
I’ve included the average between the upper and lower credit score with the name credit score in the dataset.
Lower credit score means higher risk and higher credit score means safer loans. The prosper rating, prosper score and credit score plot show a normal distribution where the amount of loans for higher and lower risk are lower than for the intermediary score.
This variable allow me to analyze the company actions after a great crise, like 2008 financial crisi.
The main change is the left side of the distribution. Before 2008, borrowers got loan with credit score lower than 500 and after 2008, clearly the minimum score was around 600.
Summary with loans until 2008.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.5 609.5 649.5 653.6 709.5 889.5
Summary with loans after 2008.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 609.5 669.5 709.5 708.9 729.5 889.5
There are a huge difference between the minimum score before and after 2008.
What is the most reliable score?
Lower is the number of default loans in the safer loan level and higher is around the higher risk level, the more reliable is that score.
## Warning: Ignoring unknown parameters: binwidth, bins, pad
## Warning: Ignoring unknown parameters: binwidth, bins, pad
With exception of the prosper rating, the distribution is close to normal. However, for this distribution, the expected is the number of loans increases with the risk level. Just the prosper rating was close to the expected.
###Loan
A loan is the lending of money from one individual, organization or entity to another individual, organization or entity at an interest rate.[9]
I think, Prosper business will be good for investor if the numbers of loans increase and the number of default loans be close to zero.
From 2009 to 2013, the amount of loans has strongly increases. 2014, the data was collected until March, consequently there was a sharp fall.
It shows the confidence from the investors and borrowers have in the business.
Until 2008, the number of default loans was higher, and after that, the credit management has improved.
Income is the money that the borrower receives in exchange for providing a good or service[3]. The Prosper dataset stores the income range of the borrower at the time the listing was created. The income verifiable variable indicates the borrower have the required documentation to support their income.
The income from the most part of the borrower vary from 25,000 to 74,999, according the range income distribution plot. The dataset also shows, 92.39% has document to support their income.
There were some question about the investment risk when I started the investigation. The key questions were How much investment return can a investor get? How much can they lose? What are the risks? Are there any classification to mesure the risk?
During this initial investigation, I figured out variables that could answer these questions.
A brief analyze about the first part of the investigation.
The data set contains 113,937 observations with 81 variables on each loan, including loan amount, borrower rate, current loan status, borrower income, borrower employment status, borrower credit history, number of investors, prosper rating and the latest payment information. There are 9 categorical variable, 3 variables is key to other database table, and 63 are integer or numeric variable.
I’m looking for a feature to measure the investidor risk. This first investigation, I figured out the default loan happens with higher frequency in the high-risk than the safer prosper rate level definition. It means, the prosper rating is the main feature because it can measure the investiment risk.
The features will help the investigation are: Loan status, estimated return, estimated loss, prosper score, credit score, the borrower income range, and their employment status.
Yes, I create two new features: year and credit score. Year is the year the loan was originated and the creditscore is the average of upper and lower range from the borrower credit score provided by a customer credit rating agency.
No, there weren’t any usual distribution. I performed operation to adjust the data. For income range, I considered as not employed the borrower with 0 range income. The reason, for my analyze, there was not differece from the 0 income and not employed income range.
During the first part of the investigation, we figured out the amount of default loan is higher for high-risk investment definition.
We will go deep to understand the relationship between the ratings and the estimated return, estimated loss, default and delinquency loans.
First, the ggapairs gets a quickly overview from the impact on the estimated loss and estimated yield return variable caused by the different ratings variable.
I would like to know what happens with the estimated loss when the risk vary from low to higher.
The prosper rating and estimated loss variable will be plotted together in a scatter plot.
Evidently, there is a strong relationship between them. Lower is the risk, lower is the estimated return and higher is the risk, higher is the estimated loss.
Instead of prosper rating, a similar plot will be created with the credit score and prosper score variable.
The relationship between credit score, prosper rating by estimated loss is not strong. 5% loss is found in prosper score varying from 3 to 10. The same happens with credit score, a credit score of 750 can be a loss varying from 1% to 15%.
What happens with the estimated return when the risk change?
Not all the plots show a strong relationship between the risk and the estimated return. The prosper rating plot better explains the estimated return behavior. The intermediate risk, like C and D, brings the higher estimated return.
What is the risk level choice by investors?
The plot shows the loans with less than 100 investors risk more, however more the number of investor, safer is the rating choice.
What is the loan amount available for borrowing according the risk?
A similar behavior from the investor is found when the variable is loan amount.
The safer the rating the higher is the amount. For value lower than 5 thousand, the concentration is distributed similarly among the prosper rating with exception the AA prosper rating that shows a low concentration. When the loan amount increments 5 thousand, the HR rating is abrupt reduced and the same happens with E rating when the loan amount is higher than 10 thousand, and with D when it is higher than 15 thousand. For the highest loan amount, the concetration is around A and B ratings.
What happen with the loss and return investment according the income range?
The next two plots show the impact of the income on the estimated loss and estimated return.
The previous plot shows a box plot from the estimated loss by income range. The first quartile, the third quartile and the mean(red point inside the box plot) decrease as the income range increase.
The same behavior from estimated loss by income range is found when we compare the estimated return by income range. The first, third quartile and the mean from estimated return variable decrease as the income range variable increase.
How much is the loss loan by risk level?
Until now, the analysis was about the estimated loss. The next plot will tell the amount of default loan by LP_NetPrincipalLoss, the principal that remains uncollected after any recoveries.
Riskier is the prosper rating, higher is the amount of the default loan. There is a amount limit available by loan according the risk, and because of this, the amount default loan for HR risk level is lower than A. How much is the borrower rate according the risk level?
This plot shows a strong relationship between the borrower rate and the risk level, the borrower rate increase when the risk level increase.
What is the best risk classifications?
After analyze different risk classification, the prosper rating feature showed the estimated loss, estimated return and the amount of default loans is going to be lower for safer classification than for higher risk definition.
What happens to investment return when the risk classification rate vary?
On average, the estimated return increase when the classification is riskier, but, for the higher risk classification, the average decrease and the estimated loss increase.
How can I be sure the classification works?
The amount of default loan by risk classification showed the riskier classification has higher concentration of default loan, what means, the loss estimation reflects the amount of default loan.
I found two strong relationship, estimated loss by prosper rating and the borrower rate by by prosper rating.
Do higher incomes range change the borrower risk classification and the estimated loss?
The number of loans increase as the income range increase. The higher risk highlighted to both not employed borrower and income range below 24,000, what means the estimated loss is higher for borrower with low-income range. The borrower with income rate between 24,999, and 74,999 has the estimated loss and prosper rating equally distributed. For income range above 74,999 the borrowers fit the safer classification and the estimated loss decrease.
How much is the loss and deliquency loan by risk level?
After add the delinquent loans at the plot, the concentration became more evident in the riskier prosper rating. I would like to know if the numbers of loans by prosper rating has increase over the years
After 2008, for safer and intermediary risk classification, AA to D, the numbers and amount loan has increased and for riskier classification, keep the same.
What is the risk preference when the investor fund a loan over the years?
From 2009 to 2014, the number of loan with more than 250 investor has increased. As already analyzed before, loan with more than 100 investors, prefer safer risk classification.
How much is the borrower rate and estimated loss according the risk level?
Borrowers that fit low risk classification has its rates and the estimated loss lower. As the risk classification increase, the borrower rate has a higher variation at high level than borrower that fit low risk classification.
The most important feature is the prosper rating. It has strong relationship with keys business variables, like borrower rate, estimated loss and with less intensity, the estimated return.
I was surprised with the Estimated loss by income range coloured grouped prosper rating plot. I thought the borrower with high income range will fit safer risk classification, however the plot showed for salary above 24,999 the classification is almost the same.
Before the analysis, I thought the most important features of the dataset were the estimated loss and the estimated return. During the analysis, I figured out the most important feature is the prosper rating. The next 3 plots was carefully chosen to explain the prosper rating importance.
The first plot is about the estimated loss by prosper rating. The prosper rating can predict the estimated loss. Prosper rating with lower risk has lower estimated loss, as the risk increase the estimated loss also increase.
This second plot is about the estimated return by prosper rating. On average, higher is the risk, higher is the estimated return. The exception is the higher risk classification, the average is lower than the E level, and there were outliers in the negative estimated return area of the plot.
The third plot confirm the prosper rating works. The amount of delinquent or default loans concentrate in the riskier levels.
Prosper rating is the most important feature because it allows the investor measure the investment risk. Based in their characteristic, the investor can choose better return and higher risk as well as lower return and lower risk.
For an investor who wants guarantee the prosper rating works, they can show the default and delinquent loan distribution by prosper rating and they will see riskier the concentration of default and delinquent loan is in the riskier level.
Work in this project was amazing. I really learned a lot with my many mistakes. The first one, I didn’t know the company and the business rule, and I didn’t study about it before analysis the data. The second, I didn’t know what question I should ask to the data. And then, the worst of them, I tried to follow similar work available on Internet about the same dataset. After spend around 40 hours of work, many histogram available and no understand about what I was doing, I researched about the company on the internet, I read many reviews about the company, explanation about how to get a loan and how to be a investor, and information about its business at www.prosper.com.
After the research, I drove my effort to know what question I should ask to the dataset. Carefully, I read the description of each variable available at the loan dataset dictionary[8] and the first doubt showed up, why should a investor fund a loan? This doubt, drove me to many question: How much is the return or loss an investor can have? Are there a classification that differ the safer from the riskier loan? How many investor can a loan have?
The questions made the work easier. There were 82 variables available, however fewer of them could answer these questions.
I figured out the variable Prosper rating was one of the most important feature. This variable has strong relationship with the estimated loss, borrower rate, the most part of the default loan were found in the risker level definition and a less strong relationship with the estimated return. It is a perfect ruler to drive investor who prefers riskier loan and higher return from investor who prefers to invest in safer loans.
It was interesting to know the higher risk don’t give the higher return. Investor can get higher return with intermediary classification risk.
Surprised me the difference from the borrower rate charged by a US Company and the rate charge by a Brazilian companies. The Brazilian borrower rate is around 6 time higher.
For future work, I would like to develop a model to predict the rate classification a borrower fits and create a model to compare the estimated return, estimated loss and the classification ruler and how the default loan fit inside the ruler with other investment options available on the market and verify the best options.
[1]https://www.investopedia.com/terms/i/investor.asp
[2]https://www.thebalance.com/are-long-term-or-short-term-investments-better-2385918
[3]https://www.investopedia.com/terms/i/income.asp#ixzz58PVUEsPQ
[4]http://www.lendingmemo.com/prosper-loan-complaints/
[5]https://www.investopedia.com/terms/d/delinquency-rate.asp
[6]https://www.prosper.com/invest
[7]https://en.wikipedia.org/wiki/Debt-to-income_ratio
[8]https://docs.google.com/spreadsheets/d/1gDyi_L4UvIrLTEC6Wri5nbaMmkGmLQBk-Yx3z0XDEtI/edit#gid=0
[9]https://en.wikipedia.org/wiki/Loan
[10]https://en.wikipedia.org/wiki/Credit_manager